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Learning Models and Real-time Speech Recognition 

Douglas G. Danforth, David R. Rogosa, 
and Patrick Suppes 

1 . INTRODUCTION AND THEORY 

In October of 1972, the decision was made at the Institute for 
Mathematical Studies in the Social Sciences (I!ISSS) to use psychological 
learning models on the problem of computer recognition of liut^an j^peec-). 
In this investigation of speech recognitioa we used the standard Iione 
telephone as an inexpensive terminal for verbal corimunicatlon in dealinp, 
x^ith an educational curriculum, sucli a? matheLiatics , In subsequent 
pages we describe mathematical learning models and some of their pro- 
perties, their implementation as part of a speech recognition system, 
and a series of system experiments with children as subjects. 

Speech recognition can be viewed as tliree separate processes: (a) 
the Internal representation of each utterance, (b) the actual recogni- 
tion process, and (c) the change of the iniernal representation upon the 
discovery of errors by the recognition process (learning) . Tn our 
approach, the representation of each -utterance is given by a vector of 
numbers U. These numbers are the digitized ampli^:udes and frequencies 
from three band-pass filters that take as input the analcg signal from, 
say, a telephone (see Section 2), In this study we deal only with re- 
cognition of individual phrases, and consequently, each utterance may be 
normalized in time to a fixed lengtli, 0,50 sees. Our recognition process 

utilizes what can be called the nearest neighboc approach. A metric 

0 

(see below) is introduced into the space and ^ distance is calculated 

from the unknown utterance U to each of the members 6£ a set of vectors 

*This research was supported by National Science Foundation Grant 
EC 443XA. 



{V} representing known phrases; the name of the V closest to U is 
assigned to LF. ^ 

1 . 1 Theta Process 

Upon the discovery chat (J was misclassif led, the correct vector V 
is updated using the following learning model c Let 0 < 8 < 1 be an 
arbitrary scalar parameter and let V be the old vector representing the 
word from which U is a sample • Then a new V representation can be con- 
structed from a weighted average of U and the old V, namely, 

V <- (t- e )*V e*U. (1) 

Note that as 9 ranges from zero to one the nev; representation ranges 
from V cc U.- This model, called the cheta process and pacterned iiTter 
psychological models developed by Bush and Hosteller (1955) and Estes 
and Suppes (1959), is one aspect of our learning approach to speech re- 
cogniticn. Let us now investigate some of the properties of this linear 
learning model. In what sense does V 'represent' a word? If U Is con- 
sidered a random sample from a population with mean vector M = EU, where 
E stands for expectation, then 

EV <- (1-G )^EV -h e*EU. (2) 

li we initialize the representation V to the first-heard utterance from 
the population, then by a simple inductive argument we find that EV = M 
too, so that V is an unbiased estimate of the mean of the population to 
which If belongs. It is well known that the sample mean is also an un- 
biased estimate of the population mean» However, V has the property of 
giving gteatet weighc to recent utterances than to earlier onec. This 



responsiveness of V is useful in providing a more accurate repx^senta- 
tion of the speaker's current pattern cf speaViingc 

It is of Interest to consider the dif^tance of a sample U of the 
population to its representation vector V so as to determine the 
likelihood of correct classification. Let d(U,V) be this distance and 
Ed(U,V) its expected value. If we assume a Euclidean metric and inde- 
pendent, identically distribuced (i.icd.) random samples U, it can 
easily be shown that this distance on the nth trial Is given by 

2n-1 

1 + (1- e) (3) 

Ed(U,V) = 2 * Ed(U,M), n=1,2,.<. 

1 + (1- e) 

where Ed(U,M) is unknown but independent of 6 and trial number. Thus 
we have an expression for the expscted distance between a member of the 
population and its representation vector Vr Figure 1 shows explicitly 
the functional form of Ed(U,V)/Ed(U,M) for the cases where n=10 and 
n-50 

Insert Figu -e 1 about here 

Equation 3 gives the expcjcced d'.stance as a fuiicciori :r n ano 
The minitTiUm of Equation 3 may be o^bc^ined by netting it.-: derivative 
eq-jal tc zero. By fixing n and solving for S in this expression, Ihz 
valuer presented in Table 1 were cbtaxned. Note that n set to zeco sig- 
nifies V ^ U initially. This table is considered l^ter in the experi- 

Insert Table 1 abou-: hare 

mental sections with regard to the error rare of cl^issif ication . 
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Fig. 1. The above plot displays the cixlf^tonce of vury pronounof-d 
minima at 9-0.166 for n.--10 and at e.---0.05l for n--50. Minima <.uch 
lhf,-'ie occur for each trial number n. Whon 9 asrjume'. om- of ther.:.- 
r.intmizing value-:;, it ic reasonable to believe that the probability of 
an utterance being correctly classified as V is maximized. 
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TABLE 1 

Values of Thetia Which Minimize the Expected Distance 
as a Function of n 



n ! e 

1 ! 1.000 

2 ! 0.472 

3 ! 0.387 
5 i 0.252 

10 I 0.166 (see experiment 2A) 

15 1 0.126 

20 I 0. 103 

25 i 0.08"' 

30 ! 0.075 

35 t 0.068 

40 [ 0.061 

45 ! 0.056 

50 ! 0.05: (see experiment 1) 
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1 , 2 Delta Process 

The theta process is essentially an estimate of the first moment of 
the populatioHo In the standard problem of statistical classification 
(Anderson, 1958), estimates of the covariance matrix are necessary to 
determine a hyperplane separating two populations. In order to avoid 
the inversion of a full covariance matrix, which is necessary with the 
classical Baysian procedure, one may use other less precise but, compu- 
tationally more efficient techniques, One of these, which we call the 

delta process, estimates the variances of the utterance components. Let 

2 

6 (delta) be a parameter that lies in the interval 0,1 , then S given 



by the learning equation 



<- (1- 6)*S^ + 6*(U - V)^ 



(4) 



is an estimate of the component variances, where U,V are as before r 
Using the two quantities V and S we may calculate a 'distance' 
between the utterance U and a representation vector V by 
D(U,V) = (U-V)'^W (U-V), (T^Transpose) 



where 



and 



W = 



(Tr«=Trace) 



TrA 



diag(S^) , dlag(S^) - 



^ S 2 
^2 



\ 



0 



^n / 



(5) 

(6) 
(7) 



which differs from the Euclidean distance d(U,V) by the replacement of 
I (the identity matrix) by W. Notice that components with high varia- 
bility are weighted less than those with low variability. 
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1.3 Beta Process 

Alt ?.rnatlv8ly . we. may \ntrcduce ".he ccr.cspn of a strength asso- 
ciaced with each component of V and then increase or decrease its value 
depefiding upon whether that ccmpoaent crrrectly cr incoricectly classi- 
fies an utterance. Let L be a vector of strengths associated with V. 
Then can be changed by multiplying by a quantity (Beta) such 
that 

1, <- > (8) 

Thus. 3^ > 1 xf i is a gccj component and ^^ = 1 if it is bad, (Eq 10) 
The weights subsequently associated with the components of V aie related 
"Zj cha a^irengths chiough nornisii2ation^ namely 

W (6') 
TrA 

where 

A =^ (diag L) , (?) 

Again ::hc dis tance be :weer< utterance and a rep resentation vecicr V is 
given by 

A gocd cr'Ti^Dt.err. i? d3.-xried v?h.^:n ^-i e^.::>i in clascif i'5t?.on ixas occurred 
Ler V' be the incc<Ls:^.ly chosen : epr c'reu-iation ve- 'c:^.: a^kd V che tiue 
vec-cr with which U ifshcuid be idencified Tu^^n component 1 ts gccd it 

111 i i i:. 1 

and bad otherwi-e. Changing the .9tr3ng:hs oy Ei;uat.ion 8 as Lu:sV^ heca 



process, which has been studied extensively in Lamperti and Suppes 
(1960), We call the combined processes (theta, delta) the delta model 
and those of (theta, beta) the beta model. 

1.A Internal vs External Learning Models 

Our use of the delta and beta models is at variance with what is 
usually done in the psychological investigation of human learning, A 
task is presented to subjects, and a mean learning curve is obtained by 
measuring the average number of correct responses as a function of the 
presentations of the task (trial number). A theoretical model is then 
proposed as a possible explanation for this correct response curve, and 
the parameters of the model are estimated from the data. We may con- 
sider such models 'external' models. In contrast, we specify explicitly 
the internal response processes . Consequently, the delta and beta 
models, as used here, may be considered 'internal' models. The theoret- 
ical link between the internal-external responses of the machine is 
suggested in Sections 3,4 and 4.3 chrcugh the comparison of the minimum 
expected distance of an utterance -o its representation vector and the 
measured error rate of experiments 1 end 2. Further theoretical inves- 
tigation of this link is underway. Section 4,2 and 4.3 discuss the 
application of an external model to the learning curves of experiment 
2A, 

2. IMPLEMENTATION 

An overview of our speech recognition system is presented pictori- 
ally in Figure 2, A call placed from a standard home telephone to an 



Institute number is automatically coupled with an Institute high-speed 
line that feeds the analog signal to our hardware filters* These fll- 

liisert Figure 2 about here 

ters are patterned after those used in Vicens (1969, 1970) and consist of 
three s'^lid-state band-pass filters whose ranges were chosen to approx- 
imate ^thfe human formant structure-^-ISO-gOO Hz, 900-2100 Hz, 2100-5000 
Hz, respectively. Since the telephone frequency response is in the 
range 300-3000 Hz, our filters adequately span this interval. The 
output from each of these detectors then is amplitude and frequency 
sampled at 10 msec Intervals and the digitized results are shipped by 
high-speed line to our PDP-10, This is all done in real time. 

The raw, digitized utterance data flows into an internal buffer 
until the hardware stops transmitting, which occurs whenever the input 
analog signal falls below a hardware specified threshold for longer than 
a hardware specified time. The buffer is dumped when the flow of input 
data ceases. The dumped data are then reformatted and time and ampli- 
tude normalized for return to the recognition programs in a convenient 
standardized form. The form is a vector of 300 numbers, (3 amp 4- 3 
freq)*(100 samples/sec)*(l/2 sec). 

The recognition process simply entails calculating the distance 
from utterance vector U to each representation vector V of the vocabu- 
lary, that is, calculating the weighted sum of squares of component 
differences. The word with the minimum distance is deemed the best 
choice. The recognition rate <ls such that some 30 words per CPU second 

, 9 
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Speech System Overview 
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Validation 



Learning algorithm 
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Fig. 2. Ovccview of speech recognition system. 
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can be compared, In experiment 2B, where a 14 word vocabulary is used> 
actual recognition times in a time-sharing environment and optimal rec- 
ognition times are comparable (about 1/2 sec). 

Changes of the internal representation of the word spoken are 
accomplished by the learning algorithms based on the theory previously 

described. Specifically, this entails modifying each component of the 

2 

representation vector V and its associated strength vector (S or L) . 

The programming requirements of the two models are quite minimal. 
The programs are written in SAIL (Stanford Artificial Intelligence 
Language), which Is a superset of ALGOL. The full curriculum of exper- 
iment 2B occupies, when running, only about 35K of core memory including 
the child's state vector (see Sees. 4 and 5), the recognition algo- 
rithms, and the audio output routines. 

The production of spoken output is presently accomplished by re- 
trieving digitized representations of the words stored on magnetic disk 
and by software regeneration of the analog signal. Again this audio 
process is executed in real time. Consequently j the interchange between 
scudent and computer is sufficiently fluent for smooth verbal communica- 
tion with the educational curriculum. 

3, EXPERIMENT 1 

3. 1 Description 

As a first quick test of the models, two highly confused utter- 
ances, the letters B and D, were chosen. Fifty utterances of each 
letter were spoken into a high quality crystal microphone and recorded 

11 



on disk in their digital forrUj after having passed through our hardware 
filters. These utterances were then cycled 10 times, in their original 
order, through the delta and beta models. 

3.2 Delta model 

The parameters 9 and 6 for the delta model ranged in the inter- 
val 0.1,1.0 and 0.1,0.4 , respe'^tively . Larger intervals were not 
used as the basic structure of the delta model was revealed In this 
range. Table 2 givet the results of the percentage of correct classi- 
fications (PCC) for the grid space. Note that under these somewhat 

Insert Table 2 about here 

artifical conditions the delta model performed well with a regular 
structure and a recognition rate of 96 percent at 0«0.1 and 6=0, K 

3.3 Beta Model 

Table 3 shows results of the beta model using the same data* Note 
at least in the preliminary test, a somewhat poorer performance (8t 
percent it 0-0.1 and 3=1.1) with less regularity of structure than th 
delta model* 

Insert Table 3 about here 

3A Theta Process 

A different, but simllaVj set of data (50 utterances each of B and 
D) was used to examine the theta process by Itself. Again the data wer 

12 



TABLE 2 

Percentage of Correct Classification, 
Delta Model 

Theta 

.40 I 61 58 53 + + + + + + + 

.30 i 70 65 65 59 56 58 + 55 53 57 

.20 ! 80 75 68 66 67 63 52 56 + 57 

.10 I 96 95 83 78 75 71 61 45 37 57 

.10 .20 .30 .40 .50 .60 .70 .80 .90 1,00 

Delta -> 

+ not processed. 
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TABLE 3 

Percentage of Correct Classifications, 
Beta Model 
Beta 

1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 K8 1.9 2.0 



Theta 
1.00 
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cycled 10 tines using ^.he beta model with p set to one (i.e., no 
change of strengths), and al?^owing 9 to vary from 0 to . 1 in steps of 
,01 and from « 1 to 1 :.n stops o£ , 1 , The fcrni of che curve in Figure 

laser t Figure 3 about here 

3 and the occurrence of the niiiiimum at .045 for 6> after 50 distinct 
trials, correspond closely to ^h-^ prediction of Figure 1 for the minimum 
distance to the representac'lG\> vector; however, the similarity of error 
rate and expected dis*ance is blurred by the fact Chat the 50 distinct 
utterances were presented ten times to the learning modelr -It ?!an be 
after 50 distinct trials, correspond closely to the prediction of Figure 
1 for the minimum distance to th6 representation vector; hovrever, the 
similarity of error rate and expecced distcLnc?^ is blurred by the fact 
that the 50 distinct ucteraoces were presented ten times to the learning 
model. It can be considered, however, taat each cycle is a sequence of 
50 dist'.uc: utterj?nce3 diferiag only in the starting configuration. 

This preliir-irary ?xp6rxnT<?rit sho^/s [r.'ooiisa for the le^rning-nicdel 
apprcr.ch :o sp^^ech ve':ogniti:>n (celCa model 96 p?,rcen^:; aad indicates 
that the thec^ process is ari>n^b)e t:> rjil^cively simple aa^.^ysis (error 
rate and expecced distance sj.iriiler:' ':y') . 

4. EXPERIMENT 2. P/iKT A 

4. 1 Descrip :ien 

In an e'^tor-r :c pt Dvide a prr-c:ic3l tosc of the ^,wo models under 
actual operating conditic.is of cel^phi^ne transniission and receptloaj we 
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designed and executed an experiment of two pares (A and B) , In A we 
acquired a data base of U children's voices spoken over the local Palo 
Alto telephone system. The telaphone arrangement 3 described in Section 
2, entailed calling a local Palo Alco number connected to the Institute 
from a university extension. The children, 3 girls and 11 boys, ranged 
in age from 6 to 13 years. A U-word vocaoulary (consisting of> the 
digits 0-9 and command words yes^ no. regeat^ and step) was chosen for 
compatibility with an elementary mathematics curriculum. Dial-A-Drill 
(Computer Curriculum Corp . > 197 0. In the experimant the vocabulary was 
presented sequentially on a cachode-ray tube terminal and was repeated 
by tha child into the telephone for a coral of 11 repetitions of each 
word. The time and amplitude normalised form of each utterance was re- 
corded on m.agnetic disk. 

For the analysis- the data were sequent ial'.y prei^ented to the beta 
and delta learning mcdels in a machine r 3,>roHentai. Icn of actual speaklrg 
conJicicn?, \ p^^rorjita: ^rv\ spjc;? w;is ->pcrnt::i for cac\\ v.y>6itl '..Fi^^s. 6, 
^) ai^o c!\^> t ccv^:n ' : ■ V ■ ■:cevr' <;.>:^\aT n - d ^ -jc mh :<e o^t ir-vL 

p c. r 'J ^, e > sel K : n^s . 

<-' . 2 learning Curver. of Correcc Classif ica :.^.on 

We <aa represent ^hs results oi this experiment by ieatning curves 
for both the delco and beta models To form each learning curve, we 
combined each of the 14 subjects ^nd their 14 responses per trial to 
form a learning curve with ten points, with each point representing 196 
subject-items on that particular crial. Since we h^ve U repetirions of 
nhe vocabulary for each child^ each learning curve has ten data points* 
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As an illustration that actual machine learning is taking place, we 
examine these curves in the context of mathematical learning theoryi In 
order to avoid imposing a specific 'external mcdel^ on the learning pro- 
cess, we examine the mean learning curve, since the same mean learning 
curve can be generated from a wide variety of models* When we define 
the asymptotic response probability P(ccrreot)= n (as n goes to 
infinity), a 'guessing^ parameter p^, and a learning parameter 0<X<1 we 
obtain the mean learning curve 

PCcorrecc on trial n) = rr - ( n-pQ)X^ ^ . (11) 

In Figure 4, we see that the learning curve for the delta model 
attains a value of abcut 95 percent correct responses. The deviations 

Insert Figure 4 about here 

of recognition rates from the average across children are indicated by 
the ± one standard deviation error bars for each trial number. The 
shape of the learning curve indicates that at least five repetitions of 
the vocabulary arc necessary for a high recognition rate. As shown in 
figure 5, the learning curve for the beca model reaches 91 percent 
correct responses, which is lower than the delta model. 



Insert Figure 5 about here 



4.3 Regression on the Learning Curves 

Since we are considering the general mean learning curve indepen- 
dent of a specific learning model, we will not estimate the parameters 
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of the curve in the conventional manner, using raaxlmum likelihood esti- 
mators or other statistical techniques based on predictions of the par- 
ticular learning mcdsl (Atkinson et,aU,1965)» Instead ve approach the 
problem of parameter estimacion as a regression problem^ with the mean 

learning curve of the form Y ^ bj + b2 Z where b^= n , b^^p^- rr and 
n 1 

Z=X . Regression analysis for many values cf X were perfortned on the 

2 

learning curve data and the bast fits, as determined by ^,he R and 
standard error of estiiriate statistics, were used ro estimate the para- 
meters. { ttjPq) of the mean learning cuvve. 

2 

For tne delta model the maxxnrium R statistic was .9/6 with an 
associated F value ct 326 testing the statistical significance of the 
regression coeficients, and a standard error of estim^ition (s e.eo) of 

•024. For X=,61 the parameter estimates n Pq were .923 and .477 

2 

respectively. Also, for X=,771, R «.937 and s.e.e=.039 with F value 

119. Here the parameter estimates were n^»9994 and pQ~.53, Thus, 

the regression analysis of the delta model learning curve yields an 

asymptotic recognition rate above 92 percent with a 100 percent as>Tnp- 

totic reccgnlci-n rate al30 giving a good fic to the learning curve c 

Fo*r che beta niodel learning curve, sim.'.lar analysis gave the 

greatest R^^.969 with See. £--=.02 and F-247. Here X=.?2, n-==.9U and 

2 

pQ=,572. The largest n was .94 with Pq=',392, X=.S0, R ^-.80, s e-e.= 
-025, and ?=159, Again asymptotic recognition rates above 90 percent 
were found with the best fit at 91.4 percent and the maximum asymptotic 
rate of 94 percent with significant F values, 
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4»4 Parameter Grid Spaces 

For sLinplictty in ccmputation and discussion, we us the average of 
the ninch and tenth trials as the asymptotic approximation, although one 
cannot be certain that asymptote has been reached by the tenth trial* 
In our discussion we consider two different asyntptotoic maxima, the group 
asymptotic maximum displayed in the learning curves and parameter grid 
spaces and th<5 individual asymptotic maxima shown in the later figures. 
The group asymptoclc maxima are obtained by averaging over the subject's 
individual asymptotic recognition rates for each grid point and select- 
ing the maximum, while the individual maxima are simply the best asymp- 
totic recognition rates for each child in his parameter space. The grid 
p:)ints for individual maxima may or may not coincide with the points for 
the group maximum. We use the group asymptoclc value as our recognition 
rate, although the mean of the individual maxima is greater, in recog- 
nition of the importance of a single parameter setting generalizable 
across children. 

To further illustr. te the structure of the delta model, consider 
Table 4, which shews the percentage of asymptotic correct classifica- 
tions averaged over the 14 subjects as a function of che parameters 9 
and 6. The parameter space displays a definite and regular structure 

Insert Table 4 about here 

for the delta model. The group maximum is 94*1 percent at grid point 

e =.4, 6=^1. 

Similarly, we display in Table 5 the structure of the asymptotic 
percentage of correct classlf icacion over a grid of parameter settings 
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Asymptotic Percentage of Correct Classifications, 
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for the beta model. The parameter space for the beta model is notice- 
Insert Table 5 about here 

ably flat even out to values of p=5.0. The group maximum is 89.8 
percent at grid point (0.3,2.2) » 

4.5 Theta Process 

Again we consider the theca process alone. Figure 6 gives the 
error rate from column one cf Table 4 ( 6=0). We note that to the 
accuracy of the curve the irdnimuxn occurs at the same value of theta that 
Table 1 predicts for the minimum of the expected distance of an utter- 
ance vector to its representation vector on the tench trial. This stri-- 

Insert Figure 6 about here 

king corre.^pondence between the minima cf the error rate and the tninima 
of the expected distance lends strength to analysis in terms of dis- 
tances. Note that this analysis holds for two dissimilar situacions, 
experiment 1 with a 2-word vocabulary and experiment 2 with a '.4-word 
vocabulary - 

4»6 Individual Asymptotic Maxima 

So far we have been considering che group asymptotic maximum using 
one grid point for all subjects. ThXs is important from an operational 
point of view, si.ice when dealing with many children In a CAi curriculum 
it would be useful to have a general parameter setting good for all 
students. In Figures 7 and 8 and Tables 6, 7> and 8 we examine distri-- 
butions of individual maxima ever the parameter space in a further com- 
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Table 5 

Asymptotic Percentage of Correct Classifications, 

Beta Model 
Beta 
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parison of the beta and delta models. We see in Figure 7 a 3 percent 

Inserc Figure 7 about her3 

overall improvement for the beta model with individual improveraents of 
as much as 13 percent for one subject whan individual maxima are used 
instead of the group maximum. For the delta model the improvement was 
only 1.6 percent with the largest Individual improvement being 3.6 per- 
cent. As can be seen from Tables 4 and 5 the delta model displays more 
regularity of structure about its group maximum than the beta mcdel 
does . 

A, 7 Comparison of Individual Maxima for Beta and Delta Models 

Note in Figure 8 the delta model does as veil or better than the 
beta model In every case but one in this comparison. If Wv** compared the 

Insert Figure 8 about here 

individual asymptotes at the group maximum the delta superiority would 
be even greater^ Hence, fiom these data from 14 children, we conclude 
that the delca model produces better recognition than che beta model. 

4.8 Distribution cf Optimal Parameter Settings 

The distribution of optimal set.ings in the paranrister space for the 
two models is displayed in Tables 6 and 7. For the delta model the 
asymptotic PCC for the grid point 9«,4, 6^,1 is consistently close to 
the Individual maKimum ^'alue f-or all s-jbjects- Nine subjects attained 

Inserc Table 6 about here 
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TABLE 6 

Distribution of Optimal Parameter Settings, 
Delta Model 
Delta 

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
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individual asymptotic maxiina at this grid point, the group asymptotic 
maximum, For the five subjects who had different individual asymptotic 
maxima, the difference between thair maximum recognition rate and their 
recognition rate for the grid point 9==. A, 6=,1 is only 3.6 percent for 
each subject. The beta model (Table 7) again shows more range and less 
definite structure than the delta model with 8 of 14 subjects having 
individual maxima distinct from the group maximum. 

Insert Table 7 about here 



4.9 Age Dependancy of Recognition Rate 

From the results in Table 8 we can determine almost no age depen- 
dence for the recognition rates of children in the age range of 6 to 13 
years old. 

Insert Table 8 about here 



5. EXPERIMENT 2, FART B 

5 . 1 Description 

The follow-up experiment was designed to determine whether or not 
the results of 2A had valid correspondence to actual working conditions 
of real-time recognition in a child's learning situation. It entailed 
investigating recognition rates for a telephone CAI mathematics curricu- 
lum with audio output based on Dial-A-Drill (Computer Curriculum Corp., 
1971), which incorporates the delta model into tha learning scheme. The 
system presently runs in a fully automatic mode In that a telephone call 
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Beta Model 
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TABLE 8 

Distribution of Individual Maximum Asymptotic PCC 
as a Function of Age 

Delta Model 

Age of Subjects 
6 7 8 9 10 11 12 13 

X X XX X 

Y XY X 

X X Y X 

X 



" Beta Model 

Age of Subjects 
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X X « 
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placed to (415) 321-TALK Is automatically answered by a coupler that 
connects the outside phone line to an Institute high-speed data line 
(HSLIO), The TENEX executive program of the PDP-10 has been modified so 
that upon sensing the input data over HSL10 it vill log in a job and run 
a prespecified program. 

The mathematics curriculum is extremely simple. It uses only the 
lA-word vocabulary described in the previous experiment, and randomly 
generates five types of questions (sum, difference, product, integer 
division, and ratios) , The computer-student dialogue consisting of 
these arithmetic questions, produces audio output that is accessed 
quickly from magnetic disk and transmitted to the student over the 
telephone . 



5.2 Dial-A-Driil Curriculum Structure 

The curriculum structure is indicated in Figure 9. A typical seg- 



In&ert Figure 9 about here 



men?: of the dialogue could consisc of the following: 

CONfPUIER: \mAT IS 7 MINUS 3 ? 
STUDENT: A 

COMPUTER: YUP, TEIAT'S RIGHT. Path ending at 1 

(see tree) 

COMPUTER: 6 OVER 3 IS WHAT ? 
STUDENT: 2 

COMPUTER: DID YOU SAY 0 ? 
STUDENT: no 

COMPUTER: SORRY, m MISTAKE. PLEASE SAY 2. 
STUDENT: 2 

COMPUTER: THANK YOU! Path ending at 2 



COMPUTER: 5 TIMES 0 IS WHAI 
SIUDENi: 5 

COMPUTER: DID YOU SAY 5 ? 
STUDENT: yes 

COMPUI^-,R: TOO BAD, 5 TINIES 0 IS 0. Path ending at 3 



Question 

I 




+y "That's 
/ \ right" 
Y/ (incorrect^ 
learning) 



nope 



"Too, bad" 

3 Y/ 
"mymistake" 
"mymistake" "please say *" 

please say *" (learning) "nope' 

(learning) "mj^istake" 
2 "please say *" 

(learning) 




SYMBOLS 



A Correct answer given by child 

-A Incorrect answer given by child 

R Recognized by computer 

-R Misrecognized by computer 

+ Node 

+n Node with "no" response elways 

+y Node with "yes" response always 

Y Computer thinks yes was said 

N Computer thinks no was said 

^ Correct answer (to be repeated by student) 
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Fig. 9. Tree diagram for learning in the mathematics curriculum. 
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Each of the above three dialogues can be represented as a path along the 
learning tree shown above. We use noncontingent learning for the delta 
model on all correct responses and also update the representation vector 
on all requested repetitions. The one possible incorrect learning node 
On the tree was not realized in practice. 

Seven subjects from Part A each answered 100 mathematics exercises 
from the curriculum. The recognition mechanism was loaded with a state 
vector for each subject obtained from the data of Experiment 2A using 
the delta model at the optimal parameter settings. 

5.3 Comparison of Parts A and B 

The resulting recognition rates for the telephone curriculum are 
shoxm in Figure 10 and average 13 percent below the best recognition 
rates for the subjects in Experiment A. The decrease In. recognition 
rates in Part B can be accounted for by educational and psychological 
factors t We did not have an introductory session to acquaint the child 

Insert Figure 10 about here 

with the system. Also, in an effort to approximate natural home condi- 
tions we gave no instructions to the child about speaking carefully. 
When faced with a mathematical question instead of a mere request to 
repeat a number che student sometimes stammered or changed his mind in ' 
the midst of an utterance (e.g., "ON If— NO-TWO !") , which had obvious de- 
grading effects on the recognition rate, Observe that even under these 
conditions the recognition rates are all above 75 percent. 
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Fig. 10, Comparison of . recognition rates on experiments 2A 



and 2B for the seven subjects completing both experiments. 



5 4 Confusion Matrix for Experiment 2B 

In Table 9 we present a confus.^on matrix of the numbers 0-9 for the 
seven subjects In experiment 2B, Kach element of the confusion matrix 
(c^j) represents the number of events where the utterance was 1 and the 

Insert Table 9 about here 

classification was j. Thus the matrix entry c^ q Indicates the number 
of events where 2 was said and the computer misclasslf led the utterance 
as 0. 

6. SUMMARY AND CONCLUSIONS 

We have constructed and tested two models of learning processes for 
the purpose of computer recognition of human speech over the telephone. 
The delta model was found superior to the beta model in all comparisons. 
For the delta model a regression analysis on the learning curve yielded 
a 92.3 percent recognition rate for 14 subjects ranging in age from 6 to 
13 years old. \^en the individual approximate asymptotic maxima are used 
the recognition rate climbs to 95.7 percentr All the recognition was 
done using a standard home telephone / / 

It should again be emphasized tliat we are conducting real-time re- 
cognition in a tlme-^^haring environment without any linguistic restric- 
tions and with relative computational simplicity. Consequently » the 
system can be used on any language from Swahili to English. 

We also tested the recognition system on an elementary mathematics 
curriculum conducted entirely over the telephone in Experiment 2B, From 
our observations we found that the children seemed quite tolerant of 
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nonperfect recognition and, Indeed, were amused when the computer made 
a mistake. 

In our efforts we are approaching speech recognition from the di- 
rection of machine learning. In analyzing experiment 2A we see that 
learning indeed occurs and is amenable to theoretical analysis for the 
purpose of predicting the learning performance from the structure of the 
model. Future efforts will be directed toward deriving the exact form 
of this performance and toward making deeper comparisons with human 
learning theory. 
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Previous Speech-Recognition Work at IMSSS 

Camille Bellissant 
.Stanford University 

1 , INTRODUCTION 

The aim of this work was to run some preliminary experiments using 
audio for both input and output in a computer-assisted instruction (CAI) 
program. 

The output part, i.e., speech production, was handled by an exis- 
ting program that gives good results for short sentences. The produc- 
tion is not done by synthesizing but by digitizing spoken words which 
are later concatenated to produce a sentence. The random access to the 
digiti2ed records on the disk allows quick retrieval and, when the com- 
puter is not overloaded, permits a continuous audio output. 

For the input part, i.e., speech recognition, it was decided to 
begin by adopting the system developed by Raj Reddy and Pierre Vicens at 
the Artificial Intelligence Project, Stanford University (Vicens , 1970). 
This choice was justified by the effectiveness of the system for recog- 
nizing isolated words belonging to a small vocabulary (about 50 words, 
which is large enough size for our purpose). 

In the next section we describe the Vicen^s program and our modifi- 
cations of it. 



2* THE MODIFIED VICEN'S PROGRAM 



In his thesis, Vicens (Vicens, 1969) presents the techniques and 
methodology he used in building the system, 

2 . 1 Preprocessing* 

The audio message to be recognized is first preprocessed by hard- 
ware. Three filters (150-900 Hz, 900-2200 Hz, 2200-5000 Hz) , correspon- 
ding roughly to the first there foruiants of voice, and an analog to 
digital converter produce for each frequency band and for each sample of 
10 ms of sound the maximum amplitude (peak to peak) and the number of 
zero-crossings of the amplitude-time function. The data are transmitted 
through a high-speed line to the software preprocessor, which normalizes 
the amplitudes, 

2.2 Segmentation. 

After the hardware and software preprocessing, the data are treated 
by the segmentation procedure. This consists of grouping the minimal 
segments of 10 ms Into wider segments presenting roughly the same acou- 
stic characteristics (sustained segments) and isolating the others into 
transitional segments. 

Although some errors can occur in this grouping, and a secondary 
segmentation procedure corrects the possible errors by looking at the 
variation of parameters in the sustained segments and at the local max^ 
Ima and minima of the amplitude parameters In the transitional segments. 
If the variation of parameters in a sustained segment exceeds a certain 
limit, or If a transitional segment presents a local extremum, the seg-^ 



ment is divided Into smaller ones. 

The last part of the segmentation is the combining process whose 
purpose is to group together acoustically similar secondary segments. 
The sustained segments are extended onto the transitional segments if 
*:he parameters are too different, 

2.3 Classification. 

After the segmentation process, most of the transitional segments, 
which do not contain pertinent information, are eliminated. The purpose 
of classification is to assign linguistic labels to the sustained seg- 
ments. The phoneme groups are fricative, vowel, stop, consonant, nasal, 
and burst* The vowels are subclassified into nine categories with re- 
spect to their zero-crossing parameters. The discrimination into pho- 
neme groups is accomplished by comparing the amplitude and zero-crossing 
parameters of the segments with known values in acoustic phonetics. 

The results of the previous processes are summarized in an internal 
representation of the speech utterance that is used for all the storing, 
retrieving and matching processes. 

2. A Recognition. 

The recognition of words is accomplished by retrieval of previously 
learned messages. This learning consists of reducing the Internal rep- 
resentation of the speech utterance and storing the reduced form in a 
dictionary. The size of such a reduced record Is about 1000 bits of 
contlnous storage for an average sound of 1 second* , 

The dictionary is provided with two Indepcnr^ant list structures 
depending on the phonetic representatipnfo the message (number of 



vowels and unvoiced fricatives), and the print name of the message. 
During the recognition phase, the dictionary organization allows a quick 
candidate list to be constructed in the following stages: 

1. Elimination of all candidates whose relative positions of 
vowels and fricatives are| different from those of the incoming message* 

2. Elimination of all the candidates with strictly different vowel 
zero-crossing characteristics. 

3. Elimination of all the candidates having low-vcwel similarity 
scores obtained by comparison with the incoming message. 

The first elimination is obtained directly from the dictionary, 
which holds the relative position of vowels and fricatives for each re- 
corded utterance* The second elimination is accomplished by using a 
table that defines crude dissimilarity values between each pair of 
vowels on the basis of their earlier classification into subcategories. 
At this stage the list of candidates is reordered, so that the most si- 
milar candidates are placed first. 

The third elimination is done by computing a similarity between the 
incoming message and all the entries in the candidate list. First, a 
segment synchronization procedure is called to create linkages between 
the segments of the two representations. The similarity values obtained 
for each pair of linked segments are stored for the selection process 
that chooses the candidate with the higher similarity coefficient, If 
one of the candidates reaches a score greater than or equal to 95 per- 
cent jthe selection process immediately stops and returns the candidate 
print name/ Otherwise, each time a good similarity score is obtained 
(>80 peicent) the candidate list is rearranged in otder to place first 



all entries having the same print name as that of the present candidate. 
When the list is exhausted, the candidate with the best score is chosen 
if the similarity score is at least greater than 75 percent. If none of 
the candidates presents such a score, the selection process Is reini- 
tiated with a new list of candidates having small differences in the 
phonetic representation (number and relative position of vowels and un-- 
voiced fricatives) , If no candidate can be found in the dictionary, it 
means that thft incoming message cannot be recognized, and the user is 
invited to enter its print name. At this time, the dictionary is aug- 
mented by the representation of the new message, which can be used 
afterwards as a possible candidate for a further utterance, 

2.5 Modifications* 

The original Vicens' program as described above was written in 
FORTRAN for the PDP-10 at the Stanford Artificial Intelligence (AI) 
Project. The program was rewritten for the PDP-10 at the Institute In 
SAIL, which is a high-level language that is a superset of ALGOL and 
that has been developed at the AI project. 

Second, hardware that has performance characteristics very similar 
to the hardware on the AI PDP-10 was designed and constructed by Ron 
Wizelman of the Institute staff. There is a slight difference in the 
handling of the incoming data as given by the hardware. The Vicens' 
program was working within a "spacewar" environment in order to impose 
priority over other users while listening to the sound- We use a high- 
speed line that gives good results at all times for a continuous inputv 
In order to allow the user not to speak as soon as the program is ready 



to listen to him, we have implemented two thresholds. One is hardware, 
the other is software. The first system is a simple potentiometer that 
inhibits the hardware equipment as long as the amplitudes are under a 
certain value. This value is adjustable and can eventually become zero* 
As soon as this threshold value is exceeded, the hardware begins to 
transmit data to the program and keeps transmitting even after the am- 
plitudes again drop under the threshold value. This delay, which is 
also adjustable, is necessary to allow small silences in the utterance 
without interrupting the transmission. Our first experiments with this 
hardware threshold have shoxm some loss of data in the very beginning of 
each utterance, due to the positive v^alue-flxed threshold. 

In order to avoid this loss of^data, we experimented with kicking 
the microphone that started the hardware and speaking just after the 
kick. The effect of the kick has been eliminated by software and so no 
data were lost. Besides the inelegance of such a method, we found it 
difficult to apply to all kinds of microphones, especially telephones. 
So we introduced the following process. The hardware threshold is set 
to zero, so the hardware is always ready to transmit data. The software 
procedure reads only three samples of sound (0.03 sec) and computes the 
averaging amplitudes and zero-crossings. If these values are under a 
threshold, three new samples are processed, and so on. If the values 
are above the threshold, the procedure fills up the input buffer (1.5 
sec) . The 'tail' of the utterance, i.e. , samples with low amplitudes at 
the end of the message, is then eliminated so that only the relevant 
values are subsequently processed by the segmentation procedure. When 
the computer is overloaded, this method (the 'software kick') sometimes 



produces a loss of one sample (10 ms), which is actually the smallest 
amount of data that can be lost. 

The other differences we introduced in the Vicens' program concern 
the selection process. First, the value of the similarity threshold (95 
in the original program) which is used when one examines the candidate 
list, was chan^^ed by an interactive corunand. We are concerned with the 
best choice of the threshold value for different sets of words. Intui- 
tively, the larger the value, the more demanding the system when it 
tries to accept a candidate as a proper answer. Sets of the words v;ith 
large phonetic dissimilarity can be processed with a low threshold and 
a consequent saving of time. 

The second difference is related to the use of the system in peda- 
gogical experiments in elementary arithmetic. The purpose of these ex- 
periments is to ask the user the results of operations on numbers. In 
this case, for each question there exists one and only one possible 
answer. When the answer is Incorrect, we do not try to recognize the 
specific value that was uttered. For example, after the question "how 
much is three plus four?" we are only interested in the comparison be- 
tween the uttered answer and 7. If it is not 7 we do not try to know 
whether it was 6, 8, or something else. In this situation, the recog- 
nition process can be considerably accelerated by limiting the candidate 
list to those that have the same print name as the expected answer. We 
found that in this way the answer processing is faster than the time 
spent to utter it, which offers some hope for communication by telephone 
when the nature of the messages to be recognized is well adapted to such 
a discrimination. 
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